Skip to main content

Profiling Python Applications

The Three-Day Mistake

A senior engineer at a data company spent three days rewriting a text preprocessing function in Cython. The function cleaned and tokenised incoming customer support tickets before they were fed into a classifier. He had read that string operations were slow in Python and was convinced the tokeniser was the problem.

After the rewrite, he ran the service under load. The p99 latency dropped from 340ms to... 332ms.

Confused, he finally did what he should have done on day one: ran a profiler.

Profiling output (sorted by tottime):

ncalls tottime percall cumtime percall filename:lineno(function)
2000 0.003 0.000 0.003 0.000 tokeniser.pyx:12(tokenise) ← his work
2000 0.007 0.000 0.007 0.000 classifier.py:45(predict)
2000 0.298 0.000 0.298 0.000 audit.py:89(log_ticket) ← !!
2000 0.012 0.000 0.012 0.000 response.py:23(format)

The log_ticket function was writing to a file using a regex to redact PII from every log line. The regex was compiled fresh on every call. The tokeniser he spent three days on? It consumed 0.3% of total runtime. The regex in the audit logger consumed 89%.

The fix: re.compile() at module level, called once. Time: 20 minutes.

This is not an unusual story. It is the default story for engineers who skip profiling.

What You Will Learn

  • Use cProfile and pstats to find CPU bottlenecks in any Python program
  • Use line_profiler to identify the exact slow line inside a function
  • Use py-spy to profile running production processes without modifying code
  • Use memory_profiler and tracemalloc to find memory bottlenecks and leaks
  • Use timeit correctly for micro-benchmarks
  • Build a profiling-first discipline that prevents the three-day mistake

Prerequisites

RequirementLevel Needed
Python functions and decoratorsComfortable
Command-line basicsComfortable
pip package installationComfortable
asyncio basics (Lesson 05 only)Helpful

The Python Performance Measurement Stack

Choose the right tool before reaching for any of them. Each tool has a specific use case and significant blind spots.

ToolTypeOverheadUse CaseCode Changes?
timeitTimerNoneMicro-benchmarks for small code snippetsYes
cProfileDeterministicMediumFinding which functions are slow, call countsYes (or not)
line_profilerDeterministicHighFinding which lines inside a function are slowYes (decorator)
py-spySampling~1%Profiling running processes, production safeNo
pyinstrumentSamplingLowDevelopment profiling, beautiful outputYes or no
AustinSampling<1%Ultra-low overhead, SVG flamegraphsNo
memory_profilerDeterministicVery highLine-by-line memory usageYes (decorator)
tracemallocDeterministicMediumAllocation tracking, leak detectionYes
objgraphSnapshotLowFinding leaked object types and referencesYes

Decision rule:

  • Unknown bottleneck → cProfile first
  • Found the function, need the line → line_profiler
  • Running production process → py-spy
  • Memory growing → tracemalloc + objgraph
  • Memory by line → memory_profiler
  • Micro-benchmark → timeit

Section 1: cProfile and pstats

cProfile is the standard starting point. It instruments every function call - recording how many times each function was called and how long each call took. It has medium overhead (typically 2–5x slowdown) and is safe to use in development.

Basic Usage

import cProfile

def slow_function():
"""A realistic function with a hidden bottleneck."""
import re
results = []
data = ["[email protected] ticket #1234 URGENT"] * 10_000

for line in data:
# Bug: compiling regex inside the loop - 10,000 compilations
pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
emails = pattern.findall(line)
results.append(emails)
return results

# Method 1: simple one-liner
cProfile.run('slow_function()')

Output:

30006 function calls in 0.847 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.847 0.847 <string>:1(<module>)
10000 0.031 0.000 0.031 0.000 re.py:232(compile)
10000 0.782 0.000 0.782 0.000 re.py:289(_compile) ← HERE
10000 0.023 0.000 0.023 0.000 {method 'findall'}
1 0.011 0.011 0.847 0.847 example.py:4(slow_function)

The column meanings:

ColumnMeaning
ncallsNumber of times this function was called
tottimeTotal time spent in this function, excluding time in sub-calls
percalltottime / ncalls
cumtimeTotal time spent in this function including all sub-calls
percallcumtime / ncalls (second percall column)

Key insight: tottime is the most useful column. It tells you how much time this function itself consumed, without attributing time to functions it called. A function with high cumtime but low tottime is slow because of what it calls, not what it does itself.

cProfile.Profile() as Context Manager

import cProfile
import pstats
import io

def profile_block(sort_by: str = 'tottime', limit: int = 20):
"""Context manager for profiling a block of code."""
import contextlib

@contextlib.contextmanager
def _profiler():
pr = cProfile.Profile()
pr.enable()
try:
yield pr
finally:
pr.disable()
s = io.StringIO()
ps = pstats.Stats(pr, stream=s)
ps.strip_dirs()
ps.sort_stats(sort_by)
ps.print_stats(limit)
print(s.getvalue())

return _profiler()

# Usage
with profile_block(sort_by='tottime', limit=10):
slow_function()

pstats for Filtering and Sorting

import cProfile
import pstats
import io

# Profile to a file
cProfile.run('slow_function()', 'profile_output.prof')

# Load and analyse the saved profile
with open('profile_output.prof', 'rb') as f:
stats = pstats.Stats('profile_output.prof')

# Sort options: 'tottime', 'cumtime', 'ncalls', 'pcalls', 'name', 'filename'
stats.sort_stats('tottime')

# Print only functions that match a filename pattern
stats.print_stats('re.py') # only regex-related functions
stats.print_stats(0.1) # only top 10% of functions by count
stats.print_stats(20) # top 20 functions

# Show callers of a specific function
stats.print_callers('_compile')

# Show what a function calls
stats.print_callees('slow_function')

snakeviz - Visual Flamegraphs

cProfile output is hard to read at scale. snakeviz renders the profile as an interactive sunburst or icicle chart in your browser:

pip install snakeviz

# Profile to a file first
python -m cProfile -o myapp.prof myapp.py

# Open the visual profile
snakeviz myapp.prof

For production profiling, save profiles per-request and attach them to slow-request logs:

import cProfile
import pstats
import io
import logging
from functools import wraps

logger = logging.getLogger(__name__)

def profile_if_slow(threshold_ms: float = 200.0):
"""
Decorator: profile the function if it takes longer than threshold_ms.
Logs the profile output without affecting normal execution.
"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
pr = cProfile.Profile()
pr.enable()
try:
result = func(*args, **kwargs)
finally:
pr.disable()
s = io.StringIO()
ps = pstats.Stats(pr, stream=s)
ps.strip_dirs()
ps.sort_stats('tottime')
ps.print_stats(15)
profile_text = s.getvalue()

elapsed_ms = sum(
row[2] for row in pr.getstats()
if row[0] == func.__code__
) * 1000

if elapsed_ms > threshold_ms:
logger.warning(
"Slow function detected: %s took %.1fms\n%s",
func.__name__, elapsed_ms, profile_text
)
return result
return wrapper
return decorator

@profile_if_slow(threshold_ms=100.0)
def process_document(doc_id: str) -> dict:
# ... production code ...
pass

Section 2: line_profiler - Finding the Slow Line

cProfile tells you which function is slow. line_profiler tells you which line inside that function is slow. This distinction matters enormously - a 100-line function might have its entire cost on line 47.

pip install line-profiler

Basic Usage

# Decorate the function you want to profile
# The @profile decorator is injected by kernprof - do NOT import it
@profile
def preprocess_batch(documents: list[dict]) -> list[dict]:
"""Realistic ML preprocessing pipeline."""
results = []

for doc in documents:
# Step 1: lowercase and strip
text = doc['content'].lower().strip()

# Step 2: remove punctuation
import string
text = text.translate(str.maketrans('', '', string.punctuation))

# Step 3: tokenise by whitespace
tokens = text.split()

# Step 4: remove stopwords (loaded fresh each time - the bug)
stopwords = set(open('/usr/share/dict/stopwords').read().split())
tokens = [t for t in tokens if t not in stopwords]

# Step 5: build bigrams
bigrams = [f"{tokens[i]}_{tokens[i+1]}"
for i in range(len(tokens)-1)]

results.append({
'id': doc['id'],
'tokens': tokens,
'bigrams': bigrams,
})

return results

Run with:

kernprof -l -v preprocess_example.py

Output:

Wrote profile results to preprocess_example.py.lprof

Timer unit: 1e-06 s

Total time: 8.421 s
File: preprocess_example.py
Function: preprocess_batch at line 4

Line # Hits Time Per Hit % Time Line Contents
==============================================================
4 @profile
5 def preprocess_batch(documents):
8 5000 12304.0 2.5 0.1 for doc in documents:
11 5000 45231.0 9.0 0.5 text = doc['content'].lower().strip()
14 5000 23187.0 4.6 0.3 text = text.translate(...)
17 5000 18432.0 3.7 0.2 tokens = text.split()
20 5000 7891234.0 1578.2 93.7 stopwords = set(open(...).read().split())
21 5000 89234.0 17.8 1.1 tokens = [t for t in tokens...]
24 5000 78234.0 15.6 0.9 bigrams = [...]

Line 20 consumes 93.7% of the total time. It opens and reads a file on every iteration of the loop. The fix:

# Load stopwords ONCE, outside the loop
STOPWORDS = set(open('/usr/share/dict/stopwords').read().split())

@profile
def preprocess_batch_fixed(documents: list[dict]) -> list[dict]:
results = []
for doc in documents:
text = doc['content'].lower().strip()
text = text.translate(str.maketrans('', '', string.punctuation))
tokens = text.split()
tokens = [t for t in tokens if t not in STOPWORDS] # uses module-level set
bigrams = [f"{tokens[i]}_{tokens[i+1]}" for i in range(len(tokens)-1)]
results.append({'id': doc['id'], 'tokens': tokens, 'bigrams': bigrams})
return results

After the fix: 0.131 s. 64x speedup, from a one-line change that required no Cython, no Numba, no C extensions - just moving file I/O out of the loop.

Programmatic Usage

from line_profiler import LineProfiler

def add_profile(func):
"""Add line-level profiling to a function programmatically."""
lp = LineProfiler()
lp_wrapper = lp(func)

def run(*args, **kwargs):
result = lp_wrapper(*args, **kwargs)
lp.print_stats()
return result

return run

Section 3: py-spy - The Production-Safe Profiler

cProfile and line_profiler require code changes. They also add significant overhead - 2–5x for cProfile, up to 50x for line_profiler. You cannot use them on a production server under load without impacting users.

py-spy is a sampling profiler. It attaches to a running Python process, reads its call stack at regular intervals (100 times per second by default), and produces a statistical profile. Overhead is approximately 1%. No code changes required.

pip install py-spy
# or: cargo install py-spy (if you have Rust)

Profiling a Running Process

# Find your process ID
ps aux | grep python

# Show a live top-like view (updates every second)
py-spy top --pid 12345

# Record a flamegraph SVG
py-spy record --output profile.svg --pid 12345 --duration 30

# Dump the current call stack (one snapshot)
py-spy dump --pid 12345

The py-spy top output looks like:

Collecting samples from 'python app.py' (pid: 12345)
Total Samples 1200
GIL: 34.00%, Active: 78.00%, Threads: 4

%Own %Total OwnTime TotalTime Function (filename)
23.00 23.00 2.76s 2.76s log_ticket (audit.py:89)
18.00 18.00 2.16s 2.16s _compile (re.py:289)
12.00 35.00 1.44s 4.20s handle_request (api.py:145)
9.00 9.00 1.08s 1.08s json_loads (json/decoder.py:337)

Profiling a FastAPI Service Under Load

Typical workflow for profiling a production-like service:

# Terminal 1: start the service
uvicorn myapp:app --workers 1 --port 8000

# Terminal 2: generate load
hey -n 10000 -c 50 http://localhost:8000/api/process

# Terminal 3: while load is running, profile it
py-spy record --output flamegraph.svg \
--pid $(pgrep -f "uvicorn myapp") \
--duration 20 \
--rate 100

# Open the flamegraph
open flamegraph.svg

Reading Flamegraphs

A flamegraph is a stack trace visualisation:

Width = time spent in function (wider = slower)
Height = call depth (bottom = entry point, top = leaf)
Colour = arbitrary (no performance meaning)

┌─ regex._compile ─────────────────┐
┌─ audit.log_ticket ────────────────────────────────────┐
┌─ fastapi.handle ─────────────────────────────────────────────────────────┐
│████████████████████████████████████████████████████████████████████████│

Look for the widest bars at the top of the stack - those are the leaf functions where the most time is being spent.

Profiling a Kubernetes Pod

# Find the pod
kubectl get pods -l app=myservice

# Exec into the pod and run py-spy
kubectl exec -it myservice-pod-abc123 -- \
py-spy record \
--output /tmp/profile.svg \
--pid 1 \
--duration 30 \
--subprocesses

# Copy the profile out
kubectl cp myservice-pod-abc123:/tmp/profile.svg ./profile.svg

Note: py-spy requires the SYS_PTRACE capability in Kubernetes. Add to your pod spec:

securityContext:
capabilities:
add:
- SYS_PTRACE

Why Sampling Profilers Are Safer

Deterministic profilers (like cProfile) record every function call. They cannot miss any call, but they impose consistent overhead throughout the profiling session. At high request rates this overhead can cause request timeouts and queue buildup.

Sampling profilers read the call stack at intervals. They can miss functions that complete between samples (typically fast utility functions), but they accurately represent where time is spent - which is what you want for finding bottlenecks. The statistical error is bounded and predictable.

For production use, always prefer a sampling profiler.

Section 4: memory_profiler - Line-by-Line Memory

CPU profiling finds slow functions. Memory profiling finds functions that allocate too much. These are different problems with different fixes.

pip install memory-profiler matplotlib

Basic Usage

from memory_profiler import profile

@profile
def build_inverted_index(documents: list[dict]) -> dict:
"""Build an inverted index - a common memory-intensive operation."""
# Step 1: collect all tokens
all_tokens = [] # potential memory issue
for doc in documents:
tokens = doc['content'].split()
all_tokens.extend(tokens) # extends by copying

# Step 2: build the index
index = {}
for doc_id, doc in enumerate(documents):
tokens = set(doc['content'].split())
for token in tokens:
if token not in index:
index[token] = []
index[token].append(doc_id)

return index

Run with:

python -m memory_profiler memory_example.py

Output:

Filename: memory_example.py

Line # Mem usage Increment Occurrences Line Contents
=============================================================
4 52.3 MiB 52.3 MiB 1 @profile
5 def build_inverted_index(documents):
8 52.3 MiB 0.0 MiB 1 all_tokens = []
9 52.3 MiB 0.0 MiB 50000 for doc in documents:
10 52.3 MiB 0.0 MiB 50000 tokens = doc['content'].split()
11 412.7 MiB 360.4 MiB 50000 all_tokens.extend(tokens) ← !!
14 413.1 MiB 0.4 MiB 1 index = {}
15 448.2 MiB 35.1 MiB 50000 for doc_id, doc in enumerate(documents):

Line 11 allocates 360 MB because all_tokens accumulates a copy of every token from every document. If you only need the index (not the flat list), all_tokens is unnecessary entirely.

Fixed version:

@profile
def build_inverted_index_fixed(documents: list[dict]) -> dict:
"""Build inverted index without the intermediate flat list."""
index = {}
for doc_id, doc in enumerate(documents):
for token in set(doc['content'].split()):
index.setdefault(token, []).append(doc_id)
return index

Memory output for fixed version:

Line # Mem usage Increment Line Contents
=================================================
4 52.3 MiB 52.3 MiB @profile
5 def build_inverted_index_fixed(documents):
6 52.3 MiB 0.0 MiB index = {}
7 52.3 MiB 0.0 MiB for doc_id, doc in enumerate(documents):
8 87.4 MiB 35.1 MiB for token in set(doc['content'].split()):
9 87.4 MiB 0.0 MiB index.setdefault(token, []).append(doc_id)
10 87.4 MiB 0.0 MiB return index

Peak usage: 87 MB instead of 448 MB. A 5x reduction from a 3-line change.

mprof - Memory Over Time

# Record memory usage over time
mprof run python memory_example.py

# Plot it
mprof plot

# Plot the most recent run
mprof plot --output memory_profile.png

This produces a time-series plot showing memory usage. Useful for identifying memory leaks in long-running processes: if memory grows monotonically over time, you have a leak.

Section 5: tracemalloc - Allocation Tracking Without Dependencies

tracemalloc is built into Python's standard library (Python 3.4+). It requires no external packages and is the right tool when you cannot install third-party packages (restricted environments, containers without pip access).

import tracemalloc
import linecache

def display_top_allocations(snapshot, limit: int = 10) -> None:
"""Display top memory-allocating lines from a tracemalloc snapshot."""
top_stats = snapshot.statistics('lineno')

print(f"\nTop {limit} memory allocations:")
print("=" * 70)
for index, stat in enumerate(top_stats[:limit], 1):
frame = stat.traceback[0]
print(f"\n#{index}: {frame.filename}:{frame.lineno}")

# Show the actual source line
line = linecache.getline(frame.filename, frame.lineno).strip()
if line:
print(f" {line}")
print(f" {stat.size / 1024:.1f} KiB - {stat.count} allocations")


def find_memory_bottleneck():
"""Demonstrate tracemalloc for finding top allocations."""
tracemalloc.start(25) # keep 25 frames of stack trace

# === Your code under test ===
import json
large_data = [{"id": i, "value": "x" * 1000} for i in range(10_000)]
serialised = json.dumps(large_data)
parsed = json.loads(serialised)
# === End code under test ===

snapshot = tracemalloc.take_snapshot()
tracemalloc.stop()

display_top_allocations(snapshot, limit=10)


find_memory_bottleneck()

Snapshot Diffing - Finding Leaks in Long-Running Processes

The most powerful tracemalloc technique: take two snapshots separated in time and look at what grew. This isolates memory that was allocated but never freed - i.e., leaks.

import tracemalloc
import time
import gc
from typing import Optional

class LeakDetector:
"""
Production-grade leak detector using tracemalloc snapshots.
Take a baseline snapshot, then compare against it periodically.
"""

def __init__(self, n_frames: int = 25):
self.n_frames = n_frames
self._baseline: Optional[tracemalloc.Snapshot] = None

def start(self) -> None:
tracemalloc.start(self.n_frames)

def take_baseline(self) -> None:
"""Call this after the process has warmed up."""
gc.collect()
self._baseline = tracemalloc.take_snapshot()
print("Baseline snapshot taken.")

def report_growth(self, top_n: int = 15) -> None:
"""Compare current state to baseline, show what grew."""
if self._baseline is None:
raise RuntimeError("Call take_baseline() first.")

gc.collect()
current = tracemalloc.take_snapshot()
stats = current.compare_to(self._baseline, 'lineno')

print(f"\nMemory growth since baseline (top {top_n}):")
print("=" * 70)
for stat in stats[:top_n]:
if stat.size_diff > 0: # only show growth, not shrinkage
frame = stat.traceback[0]
print(
f"{frame.filename}:{frame.lineno} "
f"+{stat.size_diff / 1024:.1f} KiB "
f"({stat.count_diff:+d} objects)"
)


# Example: detect a leak in a request handler
detector = LeakDetector()
detector.start()

# Simulate service warm-up
for _ in range(100):
process_request_warmup()

detector.take_baseline()

# Simulate production traffic
for i in range(1000):
process_request()
if i % 100 == 0:
detector.report_growth(top_n=5)

Finding Where an Object Was Allocated

import tracemalloc

tracemalloc.start(50) # deep stack traces

# ... run code ...

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('traceback')

# Show full stack trace for the largest single allocation
stat = top_stats[0]
print(f"Largest allocation: {stat.size / 1024:.1f} KiB")
print("Stack trace:")
for line in stat.traceback.format():
print(f" {line}")

Section 6: timeit - Correct Micro-Benchmarks

timeit is the right tool for comparing small code snippets. It is routinely misused - here is how to use it correctly.

The Wrong Way

import time

# WRONG: single run, includes interpreter startup,
# does not account for JIT warming, not reproducible
start = time.time()
result = sum(x**2 for x in range(100_000))
print(f"Took: {time.time() - start:.6f}s")

Problems:

  1. Single measurement - high variance from OS scheduling, cache state
  2. time.time() has lower resolution than time.perf_counter()
  3. Does not account for Python startup cost
  4. Does not separate setup from the thing being measured

The Right Way

import timeit
import statistics

def honest_benchmark(
stmt: str,
setup: str = "pass",
n_runs: int = 7,
loops_per_run: int = 10_000,
globals_dict: dict = None,
) -> dict:
"""
Honest benchmark following best practices:
- Multiple independent runs (n_runs)
- Multiple loops per run (loops_per_run)
- Reports median, not mean
- Separates setup from the timed statement
"""
timer = timeit.Timer(stmt=stmt, setup=setup, globals=globals_dict)

# timeit returns total time for all loops in one run
raw_times = timer.repeat(repeat=n_runs, number=loops_per_run)

# Convert to per-iteration time
per_iter = [t / loops_per_run for t in raw_times]

return {
"median_ns": statistics.median(per_iter) * 1e9,
"mean_ns": statistics.mean(per_iter) * 1e9,
"stdev_ns": statistics.stdev(per_iter) * 1e9,
"min_ns": min(per_iter) * 1e9,
"max_ns": max(per_iter) * 1e9,
}


# Compare two approaches for building a string
result = honest_benchmark(
stmt='"".join(parts)',
setup='parts = ["word"] * 1000',
n_runs=10,
loops_per_run=50_000,
)
print(f"join: {result['median_ns']:.1f} ns/iter (±{result['stdev_ns']:.1f})")

result = honest_benchmark(
stmt='s = ""; [s := s + p for p in parts]', # quadratic string concat
setup='parts = ["word"] * 1000',
n_runs=10,
loops_per_run=1_000, # fewer loops - this is much slower
)
print(f"concat: {result['median_ns']:.1f} ns/iter (±{result['stdev_ns']:.1f})")

%timeit in Jupyter

In a Jupyter notebook, %timeit is the cleanest interface:

import re

text = "Contact us at [email protected] or [email protected] for help"

# Compile regex once (should be faster)
pattern = re.compile(r'\b[\w.+-]+@[\w-]+\.[a-z]{2,}\b')

%timeit re.findall(r'\b[\w.+-]+@[\w-]+\.[a-z]{2,}\b', text)
# Output: 3.21 µs ± 82.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

%timeit pattern.findall(text)
# Output: 1.87 µs ± 34.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

%%timeit (double %) times an entire cell.

Common timeit Mistakes

MistakeProblemFix
n=1 (single run)High variance, unreliableUse repeat=7, number=10000
Including setup in the stmtMeasuring setup, not the functionUse setup= parameter
Using wall time with time.time()OS scheduling noise includedUse time.perf_counter()
Reporting mean instead of medianOutliers skew meanReport median
Not importing in setupImport overhead included in measurementPut imports in setup=
Comparing different data sizesNot an apples-to-apples comparisonHold data size constant

Section 7: Profiling in Production

Development profiling and production profiling are different disciplines.

Continuous Profiling with py-spy

Run py-spy as a sidecar or init container alongside your application:

# Dockerfile - add py-spy to the image
FROM python:3.12-slim

RUN pip install py-spy

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . /app
WORKDIR /app

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
# kubernetes-sidecar.yaml - py-spy as a sidecar
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
image: myapp:latest
securityContext:
capabilities:
add: ["SYS_PTRACE"]

- name: profiler
image: myapp:latest # same image has py-spy
command:
- py-spy
- record
- --output
- /profiles/$(date +%s).svg
- --pid
- "1"
- --duration
- "60"
- --rate
- "50"
volumeMounts:
- name: profiles
mountPath: /profiles

pyinstrument for Development

pyinstrument is a sampling profiler with beautiful terminal output. Better than cProfile for quick development profiling:

pip install pyinstrument

# Profile a script
pyinstrument my_script.py

# Profile a specific function in code
from pyinstrument import Profiler

profiler = Profiler()
profiler.start()

# ... your code ...
result = process_documents(docs)

profiler.stop()
profiler.print() # beautiful terminal output
profiler.open_in_browser() # interactive HTML report

Pyroscope - Always-On Profiling

For always-on production profiling, pyroscope pushes profiles to a central server continuously:

import pyroscope

pyroscope.configure(
application_name="myservice",
server_address="http://pyroscope:4040",
tags={
"env": "production",
"version": os.environ.get("APP_VERSION", "unknown"),
},
)

This gives you 24/7 flamegraph visibility without manually triggering profiling sessions.

Section 8: The Profiling Discipline - A Workflow

Combine all the tools into a repeatable workflow:

Step 1: CPU Profile (cProfile)
─────────────────────────────
python -m cProfile -o profile.prof -s tottime myapp.py
snakeviz profile.prof

→ Find the top 3 functions by tottime
→ Note their ncalls and cumtime

Step 2: Line Profile the Hot Function (line_profiler)
──────────────────────────────────────────────────────
Add @profile to the top 3 functions
kernprof -l -v myapp.py

→ Find the exact line responsible
→ Form a hypothesis about WHY it is slow

Step 3: Memory Profile if RSS is High (tracemalloc)
────────────────────────────────────────────────────
Add tracemalloc.start() at entry point
Take snapshot before and after the operation
compare_to() to find growth

→ Find which allocations are growing
→ Check if they are released (GC'd) or retained (leak)

Step 4: Benchmark the Fix (timeit)
────────────────────────────────────
Before changing anything: timeit the slow line
After the fix: timeit the new version
Report speedup with median and stdev

Step 5: Verify Under Production Load (py-spy)
──────────────────────────────────────────────
Deploy to staging
Run load test: hey -n 50000 -c 100 http://staging/endpoint
py-spy record --pid $(pgrep myapp) --duration 30

→ Confirm the previously hot function is now fast
→ Check no new bottleneck appeared

Red-Yellow-Green Triage

When reviewing a cProfile output, triage functions into three buckets:

ColourCriterionAction
Redtottime > 10% of total runtimeProfile with line_profiler, fix now
Yellowtottime 2–10% of total runtime, ncalls > 10kInvestigate - may be worth fixing
Greentottime < 2% of total runtimeIgnore completely

Most programs have 1–3 Red functions. Fix those. Ignore everything else.

Interview Questions

Q1: What is the difference between tottime and cumtime in cProfile output? Which is more useful for identifying bottlenecks?

tottime is the total time spent executing the function's own code, excluding the time spent in functions it called. cumtime is the total time including all recursive calls to sub-functions.

tottime is almost always more useful for bottleneck identification. A function with high cumtime but low tottime is slow because of what it calls - not what it does itself. To fix that function you need to look at its callees. A function with high tottime is itself the bottleneck and is where you direct line_profiler. The exception is when you want to find the highest-level entry point into a slow call tree - then cumtime is the right column to sort by.

Q2: Why is py-spy safer to use in production than cProfile? What are its limitations?

py-spy is a sampling profiler - it reads the process's call stack at intervals (100 samples/second by default) without modifying the process. It attaches via ptrace (Linux) or OS-specific process inspection APIs. Overhead is approximately 1%.

cProfile is a deterministic profiler that instruments every function entry and exit via Python's sys.setprofile() hook. This adds overhead to every function call - typically 2–5x. At high request rates (e.g., 1000 req/s), this overhead can push response times over timeout thresholds and cause cascading failures.

Limitations of py-spy:

  • Cannot time functions that are too fast to be caught by sampling (< 1ms duration, rarely called)
  • Requires SYS_PTRACE capability in containerised environments
  • Cannot be used for micro-benchmarking (not deterministic enough)
  • Line-level precision is limited compared to line_profiler

Q3: You suspect your FastAPI service has a memory leak. Describe exactly how you would find it using tracemalloc.

First I would enable tracemalloc at application startup with a sufficient frame depth: tracemalloc.start(25). Then I would let the service warm up (handle 100–200 requests to fill caches and pre-allocate common objects). After warm-up I call gc.collect() and take a baseline snapshot: baseline = tracemalloc.take_snapshot().

I then let the service run for 10–30 minutes under normal traffic. After the interval, I call gc.collect() again and take a second snapshot. I compare the two: stats = current.compare_to(baseline, 'lineno'), and filter for stat.size_diff > 0 (only allocations that grew). The top entries by size_diff are the candidates for the leak.

I then look at the traceback attribute of the top stat to get the full stack trace showing where the allocation originated. This is usually sufficient to identify whether it is a growing cache, an unreleased callback, a circular reference preventing GC, or a container being appended to without being cleared.

Q4: What is a flamegraph and how do you read one? What does it tell you that cProfile output does not?

A flamegraph is a 2D visualisation of call stacks sampled over time. The x-axis represents time (or fraction of samples), and width of each bar represents how much time was spent in that function. The y-axis represents stack depth - the bottom bar is the entry point, each row above shows functions called by the row below. Colour is arbitrary.

To read it: look for wide bars near the top of the stack. Those are leaf functions (no more callees) where actual time is being spent. A wide bar in the middle of the stack means a function spends most of its time in a single callee - follow the stack upward to find the real work.

What flamegraphs show that cProfile does not:

  1. Context: you see which call path leads to a slow function - e.g., handle_request → validate_input → regex_compile vs handle_request → load_user → regex_compile. Both show up as regex_compile in cProfile's flat table, but the flamegraph shows they are in different code paths.
  2. Proportion: the visual width immediately communicates relative cost without scanning columns.
  3. Real-time observation: generated from a sampling profiler attached to a live process, not a special profiling run.

Q5: An engineer says "I profiled my service and found that json.loads consumes 40% of CPU. Should I rewrite the JSON parsing in C?"

Not as the first step. There are several things to investigate before rewriting anything:

  1. Can you reduce the amount of JSON being parsed? If you are parsing the same document multiple times per request (e.g., parsing a config file on every request instead of once at startup), eliminate the redundant calls first.

  2. Can you use a faster JSON library? orjson is a Rust-based JSON library that is 3–10x faster than stdlib json for most workloads. It is a one-line change: import orjson; data = orjson.loads(raw). msgspec is another option. Test both - the right answer depends on your data shape.

  3. Can you reduce the payload size? If you are parsing large JSON objects but only using 10% of the fields, consider a streaming parser or a schema-aware deserialiser that skips unused fields.

  4. Can you cache the parsed result? If the same JSON content is parsed multiple times (common with configuration, schema objects, or repeated API responses), cache with functools.lru_cache or a TTL cache.

Only if none of these approaches is available should you consider writing a C extension - and at that point, orjson already is a C/Rust extension, so "rewriting in C" likely means investing in a domain-specific binary format like msgpack, protobuf, or flatbuffers instead of JSON.

The lesson: a 40% cost in json.loads is almost always addressable without writing C code.

© 2026 EngineersOfAI. All rights reserved.